using Plots, Images, Interact, ProgressMeter, LinearAlgebra, Statistics, Arpack
gr(
label="",
markerstrokewidth=0.5
)
imshow(img, args...; kwargs...) = heatmap(reverse(img; dims=1), showaxis=:false, args...; kwargs...)
Let $A$ be an $m \times n$ matrix. The range or column space of $A$ denoted by $\mathcal{R}(A)$ is the subspace of vectors formed from all linear combinations of its columns $A[:,1], \ldots, A[:,n]$. Let $r$ denote the rank of $A$, i.e., the number of linearly independent columns.
We would like to find $r$ basis vectors $w_{1}, \ldots, w_{r}$ for $\mathcal{R}(A)$ and express each column of $A$ as a linear combination with respect to these bases vectors. Thus, for example, we might express the first column as
$$A[:,1] = w_{1} x_{11} + \ldots w_{r} x_{r1},$$and the second column as
$$A[:,2] = w_{1} x_{12} + \ldots w_{r} x_{r2},$$and so on.
In matrix-vector notation, this yields the matrix factorization
$$ A = W X,$$where $X$ is an $r \times n$ matrix and
$$A[:,j] = w_{1} x_{1j} + \ldots w_{r} x_{rj}.$$There are many ways of expressing a matrix $A$ in such a form. As we shall shortly see, factorizations that are learned from the data yield interesting insights and enable new applications.
Let $e_i$ denote a sparse vector whose elements are all equal to zero except for the $i$-th element, which is equal to one.
Thus,
$$e_1 = \begin{bmatrix} 1 & \ldots & 0 \end{bmatrix}^T,$$and so on.
In Julia:
function eᵢ(i, n)
eᵢ = zeros(n)
eᵢ[i] = 1
return eᵢ
end
returns $e_i$ for the input arguments i and n.
The $n$ dimensional vectors $e_{1}, \ldots, e_{n}$ form an elementary basis for $\mathbb{R}^n$ -- they are often referred to as the standard or Euclidean basis vectors.
The left and right singular vectors can be learned from $A$ via an optimization problem. This entails specifying the exact objective (also known as "cost" or "loss") function and the underlying constraints on the variable(s) being optimized. We begin by describing how we can learn the largest right singular vector.
Suppose we are given the matrix $A$. We consider a "best direction" problem that is formulated as follows: among all vectors $x$ constrained to have unit length or norm -- these are vectors on the unit hypersphere -- let $x_{\sf opt}$ denote the vector that maximizes the norm of the vector $Ax$.
Mathematically this is equivalent to the optimization problem
\begin{equation}\label{eq:manopt1} x_{\sf opt} = \textrm{arg max}_{||x||_2 = 1} \parallel A x \parallel_{2}, \end{equation}This problem can be solved analytically via the singular value decomposition. Let the SVD of $A$ be given by
$$A = \sum_{i=1}^{r} \sigma_{i} u_i v_i^T,$$where $\sigma_{1} \geq \ldots \geq \sigma_{r}$ are its singular values. Then
$$x_{\sf opt} = v_{1}.$$In other words, $x_{\sf opt}$ is the right singular vector of $A$ associated with its largest singular value. If there are multiple singular values equal to the largest singular value, then any associated singular vector or their associated linear combination (which is also a singular vector) will also be a solution.
Recall that if $v_i$ is a right singular vector of $A$ associated with the singular value $\sigma_i$ and the left singular vector $u_i$ then
$$A v_i = \sigma_i u_i.$$However, we also have that
$$A (-v_i) = \sigma_i (-u_i)$$Thus $-u_i$ and $-v_i$ are also candidate left and right singular vectors associated with the singular value $\sigma_i$ and there is an inherent sign ambiguity associated with the singular vectors. The svd command in Julia always returns the same U and V matrices for the same A. The svds command does not -- more on that later.
Many numerical optimization packages minimize functions. So to use them, we will have to recast our problem of finding $u_1$ as a minimization problem instead of a maximization problem.
This is because the $x$ that minimizes $-||Ax||_2$ also minimizes $-||Ax||_2^2$. We now verify this numerically by solving the manifold optimization problem using the Optim.jl package and comparing the result to the theoretically predicted answer. To that end we will first define the function xhatMaximizes_norm_Ax_onSphere which takes as its input the matrix A and returns as its output xopt given by the numerical solution to Eq. (1).
using Optim, LinearAlgebra
function xthatMaximizes_norm_Ax_onSphere(
A::AbstractMatrix,
maxiters::Integer=1000,
x0::Vector = randn(size(A,2))
)
m, n = size(A)
x0 = normalize(x0) ## make it unit norm
opt = Optim.optimize(
x -> -norm(A*x,2), x0, ## TODO: Add the objective function
Optim.ConjugateGradient(
manifold = Optim.Sphere()
),
Optim.Options(iterations = maxiters)
)
xopt = opt.minimizer
return xopt
end
Let us see if theory matches predictions.
A = randn(3, 3)
xopt = xthatMaximizes_norm_Ax_onSphere(A, 10)
U, s, V = svd(A)
u₁, v₁ = U[:,1], V[:,1] ## TODO: Match up with theory
@show xopt
@show v₁
@show v₁' * xopt
The inner product abs(v₁'xopt) is not exactly equal to one because of the termination criterion built into the solver. As we increase maxiters, the inner-product should get closer to one (so long as there is a unique $x$, aside from sign, that maximizes $||Ax||_2$). Of course, as the number of iterations is increased, the algorithm takes longer to converge.
Let us examine the effect of the number of interations on the closeness between using the code in the next cell.
maxiters = [10, 20, 40, 80, 160, 240, 320, 500, 1000, 1250, 1500]
innerproduct = zeros(length(maxiters))
@showprogress for idx in 1:length(maxiters)
xopt = xthatMaximizes_norm_Ax_onSphere(A, maxiters[idx]);
innerproduct[idx] = (xopt'*v₁);
end
Let us plot the inner product thus computed as a function of the number of iterations.
plot(
maxiters,
innerproduct;
ylim=(-1.1, 1.1),
xlabel="maxiters",
ylabel="inner product",
marker=:circle
)
The plot looks wobbly because the inner product jumps from $-1$ to $1$. This illustrates that the numerical solver sometimes returns an answer close to -v₁, and other times close to v₁. So let's plot the absolute value in the next cell.
plot(
maxiters,
abs.(innerproduct),
ylim=(-1.1, 1.1),
label="abs. inner product",
xlabel="maxiters",
marker=:square,
color=:"red",
)
Being able to recognize matrix vector products is an imortant skill in data science. We now describe a radically new application enabled by the mathematics we have just developed.
We typically think of objects such as milk, white paper, eggshells, fog, and snow as opaque because we cannot see through them. These objects are composed of millions and billions of randomly distributed micro-scatterers which frustrate the passage of light by acting as a maze to any incident light. Changing the angle of the incident light does not affect the amount that gets through because every direction is equally bad.
Associated with every such scattering medium is a scattering matrix, which encodes how light from every incident angle gets scattered into outgiong light of every angle. It turns out that if we excite a wavefront corresponding to the largest right singular vector of this scattering matrix, then we can make the material transparent!
The singular value corresponding to this perfectly transmitting wavefront is nearly 1, while singular values correspondiong to most other wavefronts are close to zero, reflecting the near opacity of the medium. Recent research has revealed that "most" such opaque media composed of non-absorbing scatterers have at least one perfectly transmitting eigen-wavefont regardless of how deep the medium is. These wavefronts can be efficiently found and controlled; the role of the SVD in changing how we think about opaquness is described here
Click on the picture above to see a video illustrating this effect. For more details, see the write up here.
Since
$$||x^T A||_2 = ||A^T x||_2,$$and given that if $A = U \Sigma V^T$ then $A^T = V\, \Sigma^T U^T$, we can conclude, via an application of the previously derived result that
\begin{equation} u_1(A) = \arg \max_{||x||_2 = 1} ||x^T A||_2 \end{equation}We can numerically verify this via the next cell.
xopt = xthatMaximizes_norm_Ax_onSphere(A')
@show xopt' * u₁
xopt_alt = xthatMaximizes_norm_Ax_onSphere(-A')
@show xopt_alt' * u₁;
Note that we used exploited the fact that $||x^TA||_2 = ||x^T(-A)||_2$. Being able to spot matrix-vector products is an important skill in computational data science.
We will now solve the manifold optimization problem
$$ x_{\sf opt} = \arg \max_{x} \parallel A\, x \parallel_{2} $$subject to the spherical constraint
$$ ||x||_{2}^{2} = 1, $$and the orthogonality constraint $ x \perp v_{1}, \ldots v_{k-1} $ with $v_{0} = 0$.
It can be shown that
$$x_{\sf opt} = v_k.$$This provides a recipe for peeling off the singular vectors one at a time. We need to optimize over the set of vectors with unit norm that are orthogonal to the subspace spanned by the columns of the matrix
$$ V = \begin{bmatrix} v_{1} & \ldots & v_{k-1} \end{bmatrix} $$We will now reformulate this optimization problem into one involving maximization on the unit sphere.
We implement the function manoptV to do precisely this in the next cell.
using Optim
using LinearAlgebra
function manoptV(A::AbstractMatrix, k::Integer=minimum(size(A)), maxiters::Integer=1000)
m, n = size(A)
xopt = xthatMaximizes_norm_Ax_onSphere(A, maxiters)
V = xopt
for i in 2 : min(k, n - 1)
P_ortho_V = I - V * V'
xopt = xthatMaximizes_norm_Ax_onSphere(A * P_ortho_V, maxiters)
V = hcat(V, xopt)
end
if k == n
vn = (I - V * V') * randn(n)
vn = vn / norm(vn)
V = hcat(V, vn)
end
return V
end
Exercise:
Describe the algorithm in words and relate it to the approproiate optimizaiton problem involving projection matrices.
If we want to maximize $||Ax||_2$ the resulting x will be largest right singular vector. Now if we constrain x to be in the orthogonal plane corresponding to $v_1$, then x will be $v_2 \Rightarrow argmax||AP_{R(v_1)}^\perp x||_2 = v_2$. Similarly $argmax||AP_{R(v_1,...,v_k)}^\perp x||_2 = v_{k+1}$
We now compute and display the inner-product of Vmanopt and Vsvd. If they are identical then we will get a $k \times k$ identity matrix.
k = 3 ## extract k singular vectors
Vmanopt = manoptV(A, k)
Usvd, ssvd, Vsvd = svd(A)
@show Vmanopt' * Vsvd
heatmap(Vmanopt' * Vsvd; yflip=true, ticks=[1, 2, 3])
The off-diagonal elements are on the order of 1e-8 or lower, which is the default precision of the numerical solver used.
Consequently, the matrix abs(Vmanopt' * Vsvd) will typically be close to the identity matrix even when Vmanopt' * Vsvd is not, as illustrated in the next cell.
heatmap(abs.(Vmanopt' * Vsvd); yflip=true, ticks=[1, 2, 3])
We verify this numerically in the next cell.
Umanopt = manoptV(A',k)
heatmap(abs.(Usvd' * Umanopt); yflip=true, ticks=[1, 2, 3])
Julia and other languages leverage LAPACK for computing the eigenvectors and singular vectors -- that computation is not via the optimization procedure but using more "direct" methods. The optimization viewpoint is important, even if not computationally feasible for larger matrices, because it shows how we can learn the singular vectors by appropriately defining the loss function and the orthonormality constraints.
We now describe a matrix factorization into so-called principal components that utilizes the SVD.
Let $A$ be an $m \times n$ matrix whose columns $A[:,1], \ldots, A[:,n]$ denote $n$ measurements of an $m$ dimensional multi-variate dataset. Let $x$ be an $m \times 1$ unit norm vector.
Step 1: Note that the elements of the row vector
$$ x^T A = \begin{bmatrix} x^T A[:,1], & x^T A[:,2], & \ldots & x^T A[:,n] \end{bmatrix},$$can be interpreted as the coordinates of the columns of $A$ with respect to the basis vector $x$.
Step 2: In this viewpoint, the variance of the coordinates is given by
$$ \textrm{var}(x^T A) = \frac{1}{n}\sum_{i=1}^{n} \left(x^T A[:,i] -\frac{1}{n}\sum_{j} x^T A[:,j]\right)^2,$$which can be rewritten as
\begin{equation} \textrm{var}(x^T A) = \frac{1}{n} \sum_{i=1}^{n} \left[x^T \left(A[:,i] -\frac{1}{n}\sum_{j} A[:,j]\right) \right]^2. \end{equation}Step 3: Change of variables.
Let
$$ \overline{\mu}_A = \frac{1}{n} \sum_{i} A[:,i] = \frac{1}{n} A\,\mathbf{1},$$be the mean (column) vector. Then we have that
$$ \textrm{var}(x^T A) = \frac{1}{n} \sum_{i=1}^{n} \left[x^T \left(A[:,i] -\overline{\mu}_A\right)\right]^2.$$Let the centered data matrix $\overline{A}$ be defined as
$$\overline{A} = A - \overline{\mu}_A \mathbf{1}^T = A - \frac{1}{n} A \mathbf{1}\mathbf{1}^T.$$Then the variance along the direction $x$ is given by
\begin{equation} \textrm{var}(x^T A) = \frac{1}{n} \parallel x^T \overline{A} \parallel_2^{2}. \end{equation}Step 4: Putting it all together, we see that the problem of finding the "direction that maximizes variance" can be cast as the optimization problem
\begin{equation} x_{\sf opt} = \arg \max \textrm{var}(x^T A) = \frac{1}{n} \arg \max \parallel x^T \overline{A} \parallel_2^{2}, \end{equation}subject to $\parallel x \parallel_2 = 1$.
We connect it to the manifold optimization problems considered earlier, next. M
This direction is the so-called Principal Component, and applying what we derived earlier, we have that
$$x_{\sf opt} = u_{1}(\overline{A}).$$The so-called Principal Components are thus simply the left singular vectors of the centered data matrix $\overline{A}$. The function manoptPCs in the next cell computes the $k$ leading principal components.
using Optim, LinearAlgebra
using Statistics:mean
function manoptPCs(A::AbstractMatrix, k::Integer, maxiters::Integer=1000)
m, n = size(A)
centeredA = A .- mean(A, dims = 2)
return manoptV(centeredA', k, maxiters)
end
The covariance between two jointly distributed real-valued random variables $x$ and $y$ with finite second moments is defined as
$$\textrm{cov}(x,y) = E[xy] − E[x]E[y].$$The variables are considerd to be uncorrelated if their covariance is zero. Note that $V[x] = E[x^2] - (E[x])^2$ is the variance of $x$ and will be greater than zero unless $x$ is a constant, non-random variable.
The cross covariance matrix of two vector valued random variables $x$ and $y$ is the matrix defined (in "dot" notation) as
\begin{equation} \textrm{cov}(x,y) = E.[xy^T] - E.[x] E.[y]^T. \end{equation}The covariance matrix of a vector $x$ is the matrix of pair-wise covariances of the elements of $x$ and is defined as
\begin{equation} \Sigma_{x} = \textrm{cov}(x,x) = E.[xx^T] - E.[x] E.[x]^T. \end{equation}The elements of $x$ are said to be uncorrelated if the covariance matrix $\Sigma_x$ is diagonal.
The sample covariance matrix of a matrix $X$ whose every row represents a random variable and whose columnns represent $n$ samples (or measurements) is given by
\begin{equation} \widehat{\Sigma}_{X} = \dfrac{1}{n} XX^T - \mu_X \mu_X^T, \end{equation}where
$$\mu_X = \dfrac{1}{n} X \mathbf{1},$$is the mean column vector of $X$.
Consequently, when the $n$ columns of $A$ are drawn from a multi-variate normal distribution with an arbitrary mean and covariance $\Sigma_{aa}$, then in the limit of $n\to \infty$ we expect
$$ \frac{1}{n} \overline{A}\,\overline{A}^T \to \Sigma_{aa},$$so that the principal components will approximately equal the eigenvectors of the covariance matrix $\Sigma_{aa}$.
We illustrate this next for samples of a multivariate normal with zero mean and diagonal covariance.
n = 2500 # number of samples
CovarianceA = Diagonal([1, 2])
# Columns are Normal(0, CovarianceA) distributed:
A = sqrt(CovarianceA)*randn(2, n)
@show A * A' / n; # approximates CovarianceA
We will use the following plotting function to illustrate data (as a scatter plot) along with principal components (as arrows):
"""
p = scatter_quiver(A, U)
Given a 2-by-n matrix A, plot the second row coordinates against
the first row coordinates. Then add quiver arrows for the
coordinates encoded in U.
"""
function scatter_quiver(A::Matrix, Upc::Matrix; title="PC1 (red) & PC2 (black)", alpha=0.2)
p = scatter(
A[1, :], A[2, :],
alpha=alpha,
aspect_ratio=:equal,
)
quiver!(
p,
[0, 0], [0, 0],
title=title,
quivers=[(Upc[1, 1], Upc[1, 2]), (Upc[2, 1], Upc[2, 2])]; color=[:black, :red , :black]
)
return p
end
We now compute the principal components and plot them on the point cloud. Note that they should be approximately equal to $e_2$ and $e_1$, in that order.
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; alpha=0.15)
We now replace the multivariate normal distribution with the uniform distribution and see that the Principal Components are still close to $e_1$ and $e_2$ provided $n$ is large enough (relative to $n$).
# centered A:
A = sqrt(CovarianceA) * (rand(2, n) - 0.5 * ones(2, n)) * sqrt(12)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc)
We wish to expresss the matrix $Y$ as $Y = W_{\sf pca} X_{\sf pca}$ where the $W_{\sf pca}$ is the orthogonal (or unitary) matrix whose columns are the principal component directions. The rows of $X_{\sf pca}$ are the principal coordinates.
Step 1:
We first express $Y$ as the sum of a mean matrix $\overline{Y}$ and the matrix $\widetilde{Y}$ whose columns have mean zero such that
$$Y = \overline{Y} + \widetilde{Y},$$where
$$ \overline{Y} = \mu_Y \mathbf{1}^T,$$and
$$\mu_Y = \dfrac{1}{n} \sum_{i=1}^{n} Y[:,i].$$Step 2:
We then express $\widetilde{Y}$ via its SVD as $$ \widetilde{Y} = U \Sigma V^H.$$
Step 3:
With the $U$ thus returned by PCA, we can decompose
$$ Y=W_{\sf pca} X_{\sf pca},$$as below.
This yields the PCA factorization (or decomposition) of a data matrix $Y$
Exercise:
Write the function pca_factorization which takes as its input the matrix Y and returns as its output Wpca and Xpca.
"""
Wpca, Xpca = pca_factorization(Y)
* Input: Data matrix Y
* Output: Factor matrices Wpca and Xpca such that Y = Wpca*Xpca
"""
using LinearAlgebra
using Statistics: mean
function pca_factorization(Y::AbstractMatrix)
y_mean = mean(Y; dims=2)
Ymean = y_mean * ones(1, size(Y, 2))
Ytil = Y - Ymean
U, s, V = svd(Ytil, full=false)
Wpca = U*Diagonal(s) ##TODO: Fill in ??
Winv = inv(Wpca)
Xpca = Winv * Ymean + V' ##TODO: Fill in ??
return Wpca, Xpca
end
Let us verify that it is indeed a factorization using the code in the next cell. If the error is small then it is a factorization, otherwise it is not.
Y = randn(2, 1000)
Wpca, Xpca = pca_factorization(Y)
factorization_error = norm(Y - Wpca * Xpca)
println("Matrix factorization error is $(factorization_error)")
We now illustrate how the PCA factorization of a data matrix can provide interesting insights on the data, via the principal coordinates.
To that end, we begin by loading a dataset as in the next cell.
using CSV, DataFrames
Xdata = CSV.read("breast-cancer-wisconsin.data.txt")
The matrix Xdata thus loaded is $699 \times 11$ matrix. The rows represent different patients ($=$ "samples");
the columns represent different attributes given below.
| Column | Attribute | Type of data |
|---|---|---|
| 1. | Sample code number | id number |
| 2. | Clump Thickness | 1 - 10 |
| 3. | Uniformity of Cell Size | 1 - 10 |
| 4. | Uniformity of Cell Shape | 1 - 10 |
| 5. | Marginal Adhesion | 1 - 10 |
| 6. | Single Epithelial Cell Size | 1 - 10 |
| 7. | Bare Nuclei | 1 - 10 |
| 8. | Bland Chromatin | 1 - 10 |
| 9. | Normal Nucleoli | 1 - 10 |
| 10. | Mitoses | 1 - 10 |
| 11. | Class | (2 for benign, 4 for malignant) |
You can read more about the data here.
The columns 2-10 of this dataset contain the data we will analyze. Column 11 contains label information (whether a cell was "benign" or "malignant") that we will not use in the analysis but will use in the visualization of the analyzed data. To that end, we create a $9 \times 699$ matrix matrix A_breastcancer whose rows represent the various features and the columns represent different samples (associated with different patients). Some of the samples correspond to benign cells while others have malignant cells and so we will extract this label information for use later as below.
A_breastcancer = Matrix(Xdata[:,2:10])'
benign = Xdata[:,end] .== 2
println("Number of bening samples = $(sum(benign))");
malignant = Xdata[:,end] .== 4;
println("Number of malignant samples = $(sum(malignant))");
We will now visualize the data in this matrix and ascertain from the visualization whether we can tell apart features corresponding to the benign cells or the malignant cells.
To be able to display the data in two dimensions, we create a scatter plot of A_breastcancer[idx1,:] versus A_breastcancer[idx2,:] and vary idx1 and idx2.
Thus, for example when idx = 1 and idx2 = 2, we are plotting the value of the "Clump thickness" variable versus the value of the "Uniformity of Cell size" variable for each patient against each other.
We will label the "benign" and "malignant" samples in the visualization as illustrated next.
using Interact
@manipulate for idx1 = (1, 2, 3, 4, 5, 6, 7, 8, 9), idx2 = (2, 3, 4, 5, 6, 7, 8, 9, 1)
scatter(A_breastcancer[idx1,benign],A_breastcancer[idx2,benign], marker=:square, color=:red, label = "benign",
legend =:bottomright)
scatter!(A_breastcancer[idx1,malignant],A_breastcancer[idx2,malignant], color=:blue, label = "malignant")
plot!(; xlabel = "Elements of row $idx1", ylabel = "Elements of row $idx2")
end
Thus projecting the 9-dimensional data onto the canonical Euclidean coordinates corresponding to idx1 and idx2 does not yield any insights.
We now consider the PCA factorization of the same matrix and generate a scatter plot using the first and second principal coordinates.
Wpca, Xpca = pca_factorization(A_breastcancer)
scatter(Xpca[1,benign],Xpca[2,benign],
color=:red, marker=:square, label = "benign", legend=:bottomright,
xlabel = "PCA Coordinate 1", ylabel = "PCA Coordinate 2")
p_pca = scatter!(Xpca[1,malignant],Xpca[2,malignant], color=:blue, label = "malignant")
plot!(; title = "PCA")
Thus the PCA factorization provides insights that the canonical factorization does not!
PCA works well in many settings and is a workshorse algorithm for many machine learning and data science applications.
There are situations where a PCA matrix factorization does not yield insights but other factorizations do. We now consider one such factorization.
To motivate ICA, we perform PCA on a data matrix whose columns are drawn from the multivariate normal distribution except with an identity covariance matrix.
In this setup, the covariance can have as its eigenvectors any $2 \times 2$ orthogonal matrix. Consequently, the principal components will similarly be randomly distributed as almost every direction is "near equally good enough".
For a particular realization of $A$ there will be a "best direction" corresponding to the leading left singular vector of $\overline{A}$. For different realizations of $A$, however, the answer will change as we shall see next.
n = 1000
A = randn(2, 1000)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; title="PCs: Independent realization 1")
n = 1000
A = randn(2, 1000)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; title="PCs: Trial 2")
Finally, we conclude with an example of a non-multivariate normal with identity covariance where PCA yields random directions.
n = 1000
A = (rand(2, n) - 0.5 * ones(2, n)) * sqrt(12)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; title="PCs: Trial 1")
n = 1000
A = (rand(2, n) - 0.5 * ones(2,n)) * sqrt(12)
Upc = manoptPCs(A, 2)
scatter_quiver(A, Upc; title="PCs: Trial 2")
We might wonder if, in the above two examples, it is possible to recover the directions aligned with the axes. The geometry of the point cloud seems non-isotropic, and hence different than in the Gaussian setting where it appears to be isotropic. It turns out we can using a technique called Independent Component Analysis --which we describe next.
As before, let $\sum_{i} A[:,i] = 0$ and furthermore, assume that
$$ \dfrac{1}{n} AA^T = I.$$Consider the manifold optimization problem
$$ x_{\sf opt} = \textrm{arg max} |\kappa_4(x^T A)| ,$$subject to the spherical constraint
$$ \| x \|_2^{2} = 1,$$where, for $y \in \mathbb{R}^{n}$, $\kappa_4(y^T)$ is the fourth central cumulant or kurtosis and is defined as
$$\kappa_4(y^T) = \dfrac{1}{n} \sum_i y_i^4 - 3 \left[ \dfrac{1}{n} \sum_i y_i^2 \right]^2.$$This problem cannot be solved in closed form, but can be solved numerically as coded up in the next cell.
Exercise:
Write a function called kurtosis which takes as its input the vector x and the matrix A and returns as its output the kurtosis of the vector $x^H \,A$.
using Statistics:mean
kurtosis(x::Vector, A::AbstractMatrix) = mean(x -> x^4, x' * A) - 3 * mean(x -> x^2, x' * A)^2
Thus while PCA finds directions that maximize the variance, ICA will find directions that maximize the absolute kurtosis. You can see this in the appropriate section of the code below.
Exercise:
Write a function called xthatMaximizes_kurt_xTA which takes as its inputs the matrix A, the number of iterations maxiters, and an initial starting vector x0 and returns as its output the unit-norm $x$ that maximizes the absolute kurtosis of $x^H A$.
Recall that the optimizer in the Optim package minimizes the objective function.
using Optim
using Statistics:mean
using LinearAlgebra
kurtosis(x::Vector, A) = mean(x -> x^4, x' * A) - 3 * mean(x -> x^2, x' * A)^2
function xthatMaximizes_kurt_xTA_onSphere(
A::AbstractMatrix,
maxiters::Integer=1000,
x0::Vector = randn(size(A,1))
)
m, n = size(A)
x0 = normalize(x0)
opt = Optim.optimize(
x -> kurtosis(x, A), x0, ## TODO: Add the objective function
Optim.ConjugateGradient(
manifold=Optim.Sphere()
),
Optim.Options(iterations=maxiters)
)
xopt = opt.minimizer
return xopt
end
The function below returns several independent components in a manner that is analogous to manoptPCs.
function manoptICs(A::AbstractMatrix, k::Integer=minimum(size(A)), maxiters::Integer=1000)
m, n = size(A)
xopt = xthatMaximizes_kurt_xTA_onSphere(A, maxiters)
U = xopt
for i in 2:min(k, m - 1)
P_ortho_U = I - U * U'
xopt = xthatMaximizes_kurt_xTA_onSphere(P_ortho_U * A, maxiters)
U = hcat(U, xopt)
end
if k == m
um = (I - U * U') * randn(m)
um = um / norm(um)
U = hcat(U, um)
end
return U
end
Exercise:
-- What is the optimization problem that yields $U[:,i]$ above (not to be confused with the $U$ of the SVD) as a solution?
-- What is the role played by the P_ortho matrix relative to the optimization problem?
Let $\kappa_4(x)$ denote the kurtosis of a random variable $x$. Then
$$u_{\sf ica, i} = \arg \max_x \color{red}{|𝜅_4(x'P^\perp_{R(u_1,...u_{i-1})}A)|},$$subject to
$$ \color{red}{||x||_2^2 = 1}.$$We run it on the dataset for which PCA seemingly failed and then illustrate to you that the directions it produces are interestingly different than what PCA produced.
@show Uic = manoptICs(A, 2);
println("Sum of the absolute entries of Uic equals $(sum(abs.(Uic)))")
scatter_quiver(A, Uic; title="ICs")
n = 10000 # number of samples
Q = qr(randn(2, 10)).Q
println("Q = $Q")
A = Q * sqrt(12) * (rand(2, n) .- 0.5)
Upc = manoptPCs(A, 2)
println("Upca = $Upc")
Uic= manoptICs(A, 2)
println("Uica = $Uic")
We now compare PCA versus ICA.
plot(
scatter_quiver(A[:,1:10:end], Upc; title="PCA"),
scatter_quiver(A[:,1:10:end], Uic; title="ICA")
)
Exercise:
We claim that ICA recovers a direction that aligns with the edges of the square, even as PCA does not.
Do you agree? Run the experiment above multiple times and remark on what you notice about the ICA recovered bases relative to what PCA recovers.
Yes, ICA recovers a direction that aligns with the edges of the square most of times, whereas PCA typically fails to do this.
Our claim is based on the fact that ICA is designed to unmix independent signals and so the ICA factorization yields as-independent-as-possible coordinates. In contrast, PCA yields uncorrelated coordinates. Independent coordinates are uncorrelated but uncorrelated coordinates are not independent#Examples_of_dependence_without_correlation) -- this is why ICA can yield additional insights in settings where PCA does not.
We will delve into this much further in a subsquent codex. For now, we will proceed with ICA as a black-box and develop the analog of the PCA factorization and see what we can do with ICA that we could not with PCA.
Our goal is to decompose $Y$ as
$$ Y = W_{\sf ica} X_{\sf ica},$$where the rows of $X_{\sf ica}$ are independent (or at least "as independent as possible"). This is in contrast to PCA, where the analogous $X_{\sf pca}$ has rows that are uncorrelated rather than independent.
We will choose a normalization so that $X_{\sf ica}$ is "white" -- in other words $X_{\sf ica} X_{\sf ica}^H = I$.
To that end, we first compute the (rank one) matrix with columns equal to the column-wise mean (or the average of the columns) of $Y$.
$$\overline{Y} = \mu_Y 1^T,$$and
$$\mu_Y = \dfrac{1}{n} \sum_{i} Y[:,i].$$Step 2:
We now express $\widetilde{Y}$ via the SVD as
$$\widetilde{Y} = U \Sigma V^H,$$and apply ICA on $V^H$ to obtain the orthogonal (or unitary) matrix $Q_{\sf ica}$ that decomposes $V^H$ as
$$V^H = Q_{\sf ica} V_{\sf ica}^H.$$The $V_{\sf ica}^H$ thus yielded has rows that are as independent as possible. (We will delve into greater detail of ICA in a subsequent codex).
Step 3:
With the $Q_{\sf ica}$ returned by ICA we can compute $V_{\sf ica}$ and decompose $Y$ as
$$Y = W_{\sf ica} X_{\sf ica},$$as below.
We have derived the ICA factorization (or decomposition) of a data matrix $Y$.
Exercise: Complete the code below
function ica_factorization(Y::AbstractMatrix)
μy = mean(Y; dims = 2)
Ymean = μy * ones(1, size(Y, 2))
Ytil = Y - Ymean
U, s, V = svd(Ytil)
S = Diagonal(s)
Qica = manoptICs(sqrt(size(V, 1)) * V', size(Y, 1)) ## Whitening step
Vica = V*Qica ## TODO: Match this with equations above
Wica = U*S*Qica ## TODO: Match this with equations above
Xica = inv(Wica)* Ymean + Vica' ## TODO: Match this with equations
return Wica, Xica
end
We now generate perform an ICA factorization of a matrix $Y$ and check to see that $Y = WX$.
Wica, Xica = ica_factorization(Y)
factorization_error = norm(Y - Wica * Xica)
println("Matrix factorization error is $(factorization_error)")
Because the factorization error is small it means that we have exactly (to numerical precision) factorized the matrix. If the factorization error is not small, it is not a factorization.
Note that the ICA factorization is not unique because there are many ways of generating an $X_{\sf ica}$ that has independent components. For example, let $D$ be an arbitrary (invertible) diagonal matrix. Then, note that we have that
$$Y = (W_{\sf ica} D) (D^{-1} X_{\sf ica}),$$and the matrix $D^{-1} X_{\sf ica}$ will also have independent components, except with different, rescaled by $1/d_i$ amplitudes for each row $i$.
The ICA factorization that we have derived above enforces the constraint that $X_{\sf ica} X_{\sf ica}^H = I$ -- here, too, the same non-uniqueness property holds if we pick arbitary $D = \textrm{diag}(\pm 1, \ldots, \pm 1)$.
We now use ICA to factorize the breast cancer dataset, from earlier.
Wica, Xica = ica_factorization(A_breastcancer)
benign = Xdata[:,end] .== 2
scatter(Xica[1,benign],Xica[2,benign],
color=:red, marker=:square, label = "benign", legend=:bottomright,
xlabel = "Indepenent Coordinate 1", ylabel = "Independent Coordinate 2")
malignant = Xdata[:,end] .== 4
p_ica = scatter!(Xica[1,malignant],Xica[2,malignant], color=:blue, label = "malignant")
plot!(; title = "ICA")
We now compare PCA to ICA.
plot(p_pca, p_ica, layout = (1,2), size = (800, 400))
Exercise:
-- What is the dfference between the Principal Coordinates and the Independent Coordinates?
-- Which one provies better discrimination between the benign and malignant samples?
Principal Coordinates are the ones that have the maximum variance where as the Independent Coordinates are the ones that are independent.
ICA provides better discrimination between the benign and malignant samples
We now illustrate additional applications of ICA.
using Images
image1 = "images/skyline1.jpeg"
load(image1)
image1 = "images/skyline1.jpeg"
image2 = "images/skyline2.jpeg"
I1 = Float64.(Gray.(load(image1)))
sizeI1 = size(I1)
I2 = Float64.(Gray.(load(image2)))
p1 = imshow(I1, color=:grays, aspect_ratio=:equal, title="I1")
p2 = imshow(I2, color=:grays, aspect_ratio=:equal, title="I2")
plot(p1, p2, size=(900, 250))
We now convert the images into vectors $s_1$ and $s_2$ and then form the matrix
$$ S = \begin{bmatrix} s_1^T \\ s_2^T \end{bmatrix},$$and for an arbitrary $A$ matrix produce from $S$ the mixed variables
$$Y = A S,$$as in the next code cell.
# run this cell only once
s1 = vec(I1)
s2 = vec(I2);
Julia does not have an analog of MATLAB's clear function; once a name is defined in a Julia session (technically, in module Main), it is always present. The images we are loading here are rather large, and thus the memory consumed by the variable I1 is large. We can free the memory used by I1 by reassinging the variable as in the next cell.
I1 = 0.0
I2 = 0.0 ## this clears I1 and I2 from memory
We now mix the images together and display the mixed images -- they should be imperceptible!
A = [0.5 0.5; 0.5 -0.5];
S = [s1 s2]';
Y = A * S;
S = 0.0; # clear S variable
mixed1 = reshape(Y[1, :] / maximum(Y[1, :]), sizeI1)
mixed2 = reshape(Y[2, :] / maximum(Y[2, :]), sizeI1)
p3 = imshow(
mixed1,
color=:grays,
aspect_ratio=:equal,
title="mixed image 1"
)
p4 = imshow(
mixed2,
color=:grays,
aspect_ratio=:equal,
title="mixed image 2 "
)
plot(p3, p4; size=(900, 250))
We then apply the ICA factorization to the mixed variables -- this yields $W_{\sf ica}$ $S_{\sf ica}$. We display the rows of $S_{\sf ica}$ as images. If the images thus displayed match the original images then we have succeeded in unmixing the variables.
mixed1 = 0.0
mixed2 = 0.0 # clear mixed1 and mixed2 variables
Wica, Sica = ica_factorization(Y)
@show pinv(Wica) * A
Sica = abs.(Sica) ## make sign positive since it's an image
# Y = 0.0; # clear Y variable
unmixed1 = reshape(Sica[1,:] / maximum(Sica[1, :]), sizeI1)
unmixed2 = reshape(Sica[2,:] / maximum(Sica[2, :]), sizeI1)
Sica = 0.0; # clear Sica variable
p5 = imshow(unmixed1; color=:grays, aspect_ratio=:equal, title="Unmixed Image 1")
p6 = imshow(unmixed2; color=:grays, aspect_ratio=:equal, title="Unmixed Image 2")
unmixed1 = 0.0
unmixed2 = 0.0
plot(p3, p4, p5, p6; layout=4, size=(900, 600))
We now exmaine on of the unmixed images in greater detail and compare it to the original image
plot(p1, p6, size = (900,300)) #TODO: Replace ?? with either p1 or p2
Exercise:
Comment on how well ICA unmixes the images.
Are there regions within the image where the unmixed image is imperfect?
ICA unmixes the images very well, the final intensities in the images are slightly different between the original and the unmixed images. Some of the shadows from the buildings in the other image are still there.
##TODO: Unmix using PCA -- show both mixed images versus both unmixed images
mixed1 = 0.0
mixed2 = 0.0 # clear mixed1 and mixed2 variables
Wpca, Spca = pca_factorization(Y)
@show pinv(Wpca) * A
Sica = abs.(Spca) ## make sign positive since it's an image
# Y = 0.0; # clear Y variable
unmixed1 = reshape(Spca[1,:] / maximum(Spca[1, :]), sizeI1)
unmixed2 = reshape(Spca[2,:] / maximum(Spca[2, :]), sizeI1)
Spca = 0.0; # clear Sica variable
p5 = imshow(unmixed1; color=:grays, aspect_ratio=:equal, title="Unmixed Image 1")
p6 = imshow(unmixed2; color=:grays, aspect_ratio=:equal, title="Unmixed Image 2")
unmixed1 = 0.0
unmixed2 = 0.0
plot(p3, p4, p5, p6; layout=4, size=(900, 600))
##TODO: Unmix using PCA -- show both one of the umixed images and compare to original
plot(p1, p6, size = (900,300))
Exercise:
Comment on how well PCA unmixes the images.
Are there regions within the image where the unmixed image is imperfect?
PCA doesn't unmix the images very well. There is still a lot of residual from the other image.
We will now use ICA to unmix audio signals. Such a situation arises when, for example, we have two microphones recording the conversations of two speakers. Note that we need as many mixed recordings as there are sources to unmix the original recordings. Before we begin, navigate to the diretory with the audio recordings and play them so you know what the original recordings sound like.
using WAV
# create mixed sources & save as .wav file
source_num = [2 5]; # pick two -- from 1, 2, 3, 4 ,5
# try different sources here!
source1 = "audio/source$(source_num[1]).wav"
source2 = "audio/source$(source_num[2]).wav"
s1, fs = load(source1)
s2, fs2 = load(source2)
min_s1s2_size = min(length(s1), length(s2))
# strip to same length
s1 = s1[1:min_s1s2_size]
s2 = s2[1:min_s1s2_size]
S = [s1 s2]'
We now mix the recordings together and say the recordings as .wav files. Navigate to the directory where these recordings are saved and listen to them so you can see how they sound and that we have indeed mixed them!
A = [2 1.5; 1.5 2]; # mixing matrix
# Try different A matrices!
Y = A * S # mix the signals together
wavwrite(Y[1, :], "mixed1.wav", Fs=fs)
wavwrite(Y[2, :], "mixed2.wav", Fs=fs)
## open and play the .wav file in the directory
Audio of first mixed waveform
We now apply ICA to the mixed recordings and save them as .wav files -- play these files back to see how well have suceeded (or not). Then try it for different $A$ matrices.
Sica = ica_factorization(Y)[2]
invSica = (1.0 ./ maximum(abs.(Sica); dims=2))[:, 1] ## normalize to same peak amplitude
Sica = Diagonal(invSica) * Sica
wavwrite(Sica[1, :], "ica_unmixed1.wav", Fs=fs)
wavwrite(Sica[2, :], "ica_unmixed2.wav", Fs=fs)
s1, s2 = 0, 0
plot_ica = plot(plot(Sica[1,:], color =:blue, label = "", xlabel = "t", ylabel = "s1ica(t)"),
plot(Sica[2,:], color =:blue, label = "", xlabel = "t", ylabel = "s2ica(t)"),
layout = (2,1), title = "ICA")
plot_true = plot(plot(S[1 ,:], color=:black, xlabel = "t", ylabel = "s1(t)", title = "Original signals"),
plot(S[2,:], color=:black, xlabel = "t", ylabel = "s2(t)") ,
layout = (2,1))
plot(plot_true,plot_ica, layout = (1,2), size = (900, 400))
We now listen to the audio to confirm that they were indeed recovered.
Audio of the first ICA unmixed waveform
Audio of the second ICA unmixed waveform
##TODO: Unmix using PCA -- your code here
Spca = pca_factorization(Y)[2]
invSpca = (1.0 ./ maximum(abs.(Spca), dims = 2))[:, 1] ## normalize to same peak amplitude
Spca = Diagonal(invSpca) * Spca
wavwrite(Spca[1, :], "pca_unmixed1.wav", Fs=fs)
wavwrite(Spca[2, :], "pca_unmixed2.wav", Fs=fs)
Y = 0
plot_pca = plot(plot(Spca[1,:], color =:green, label = "", xlabel = "t", ylabel = "s1pca(t)"),
plot(Spca[2,:], color =:green, label = "", xlabel = "t", ylabel = "s2pca(t)"),
layout = (2,1), title = "PCA")
plot_true = plot(plot(S[1 ,:], color=:black, xlabel = "t", ylabel = "s1(t)", title = "Original signals"),
plot(S[2,:], color=:black, xlabel = "t", ylabel = "s2(t)") ,
layout = (2,1))
plot(plot_true,plot_pca, layout = (1,2), size = (900, 400))
plot(plot_true,plot_ica, plot_pca, layout = (1,3), size = (900, 400))
Exercise:
Comment on how the plot of the waveforms receovered by PCA and ICA reveal which one succeeds in unmixing the mixed audio sources.
Both PCA and ICA donot produce waveforms that match the waveforms of the original audio signals that were mixed together.
We now compare the unmixed audio -- this is the ultimate test!
Audio of the first PCA unmixed waveform
Audio of the second PCA unmixed waveform
Exercise: How does the PCA unmixed audio sound?
The 2 audio signals still seem pretty mixed up. Hence PCA doesn't appear do be unmixing the signals very well.
We now mix waveforms together and apply ICA to the mixed waveforms.
sawtooth(t, width=0.5) = mod(t, width)
square(t, width=0.5) = sign(sin(t / width))
t = -5:0.01:5
A = [1 0.5; 1 1]
@manipulate for s1type in ["sin", "square", "sawtooth"],
s2type in ["cos", "square", "sawtooth"]
if s1type == "sin"
s1 = sin.(2 * t)
elseif s1type == "square"
s1 = square.(t, 1.25)
elseif s1type == "sawtooth"
s1 = sawtooth.(t, 2.3)
end
if s2type == "cos"
s2 = cos.(3 * t)
elseif s2type == "square"
s2 = square.(t)
elseif s2type == "sawtooth"
s2 = sawtooth.(t)
end
S = [s1 s2]'
Y = A * S
Sica = ica_factorization(Y)[2]
p1 = plot(t,S[1,:]; color=:red, title="Signal 1")
p2 = plot(t,S[2,:]; color=:blue, title="Signal 2")
p3 = plot(t,Y[1,:]; color=:red, title="Mixed Signal 1")
p4 = plot(t,Y[2,:]; color=:blue, title="Mixed Signal 2")
p5 = plot(t,Sica[1,:]; color=:red, title="ICA Unmixed Signal 1")
p6 = plot(t,Sica[2,:]; color=:blue, title="ICA Unmixed Signal 2")
plot(p1, p2, p3, p4, p5, p6; layout=(3, 2))
end
Note that the unmixed signal does not have to have the same amplitude as the original signal for us to declare it as unmixed.
This is because of the normalization chosen in the ICA decomposition (of making the covariance matrix of the unmixed signal equal to the identity matrix). Consequently, the unmixed signal can also have flipped sign relative to the original signal.
We now compare ICA to PCA.
## TODO code here to unmix with PCA
t = -5:0.01:5
A = [1 0.5; 1 1]
@manipulate for s1type in ["sin","square","sawtooth"], s2type in ["cos","square","sawtooth"]
if s1type == "sin"
s1 = sin.(2 * t)
elseif s1type == "square"
s1 = square.(t, 1.25)
elseif s1type == "sawtooth"
s1 = sawtooth.(t, 2.3)
end
if s2type == "cos"
s2 = cos.(3 * t)
elseif s2type == "square"
s2 = square.(t)
elseif s2type == "sawtooth"
s2 = sawtooth.(t)
end
S = [s1 s2]'
Y = A * S;
Spca = pca_factorization(Y)[2]
p1 = plot(t,S[1,:]; color=:red, title="Signal 1")
p2 = plot(t,S[2,:]; color=:blue, title="Signal 2")
p3 = plot(t,Y[1,:]; color=:red, title="Mixed Signal 1")
p4 = plot(t,Y[2,:]; color=:blue, title="Mixed Signal 2")
p5 = plot(t,Spca[1,:]; color=:red, title="PCA Unmixed Signal 1")
p6 = plot(t,Spca[2,:]; color=:blue, title="PCA Unmixed Signal 2")
plot(p1, p2, p3, p4, p5, p6; layout=(3, 2))
end
Exercise:
Comment on the unmixing produced by PCA. Are there any pairs of signals that it does nearly unmix?
No, PCA doesn't unmix any of the signals well enough.
The reason PCA fails is because fundamentally the signals returned by PCA correspond to the right singular vectors of the SVD o the mean-centered matrix. Thus the PCA signals are intrinsically orthogonal to each other whereas the signals we mixed together are not. So PCA returns linear combinations of the mixed signals in such a way that they are orthgonal -- this fundammentally makes PCA unable to unmix such signals unless the signals were orthogonal to begin with and the mixing proces produced orthogonal signals.
ICA, on the other hand does not enforce orthogonality and so it has the additional freedom to unmix such signals that PCA does not. We will delve into this in greater detail in a subsequent codex.
There are many ways to factorize a data matrix. PCA and ICA are learned factorizations corresponding to specific loss functions and specific orthogonality constraints. We saw that PCA yields (an approximation of) the eigenvectors of the covariance matrix whereas ICA yields (as) independent (as possible) components.
Different projections reveal different information about the data -- as in this artwork by Sue Webster and Tim Noble

ICA and PCA yield different projections of the data and hence sometimes one is better than the others depending on what information is hidden in the data.
Finding patterns in data, even when we are seemingly given garbage in, is all about perspective and learning how to view things in just the right way.
The notion of statistical independence of a pair of deterministic images is not obvious. Statistical independence of two random variables is easier to visualize. We begin with a formal definition of independence.
Mathematically, $x$ and $y$ are statistically independent if the joint distribution $f(x,y)$ factorizes so that $f(x,y) = f(x) f(y)$. Independence implies uncorrelatednesss but not nice versa -- uncorrelated variables are independent only if they are also jointly Gaussian.
Let us first visualize the scatter plot of two independent random variables.
scatter(rand(1000), randn(1000); alpha=0.3, xlabel="x", ylabel="y")
An equivalent characterization of independent is that
$$f(y|x=x_0) = f(y).$$In other words, the conditional distribution of $y$ for any value of $x = x_0$ is the same. Thus if we "scan" over the x axis in the above plot, then $x$ and $y$ are (close to being) independent if the marginal distribution of y values along any vertical "slice" at $x=x_0$, for every non-trivial $x_0$ is about the same.
We can now compare it to the scatter plot of two dependent random variables.
x = randn(1000)
y = randn(1000)
scatter(x, x + y; alpha=0.3, xlabel="x", ylabel="y")
An equivalent characterization of independent is that
Armed with this insight into gauging the independence of two variables, we can see if the original images are close to being independent using the command in the next cell.
using Random: randperm
subset = randperm(length(s1))[1:1000]
scatter(
s1[subset], s2[subset];
alpha=0.3,
xlabel="Image 1 pixel values",
ylabel="Image 2 pixel values"
)
Gaussian signals have a kurtosis of zero. Thus ICA cannot unmix mixtures of Gaussians. It can, however, unmix a mixture of more than two independent signals so long as at most one signal has a kurtosis of zero.
There are other variants of ICA that utilize higher order cumulants and other loss functions such as entropy. Each loss function comes with its set of unmixability (or identifiability) conditions.
We will investigate these further in a subsequent codex.
When ICA works, as in it succesfully unmixes the signals, it is because the signals are approximately independent, in a statistical sense. In the next cell, consider examples from above where ICA does work and show you can visualize their relative approximate independence.
##TODO: Code for displaying approximate independence of images where ICA successfully unmixes images
##TODO: Code for displaying approximate independence of signals where ICA succesfully unmixes signals
## Load and mix the signals here
S = ??
subset = randperm(length(S[1, :]))[1:1000] ## select a random subset of the pixels to make plot less cluttered
scatter(S[1,subset],S[2,subset];
label="",
color=:green,
alpha = 0.1,
xlabel ="signal 1 values",
ylabel = "signal 2 values")
##TODO: Code displaying approximate independence of audio where ICA succesfully unmixes audio signals
# create mixed sources & save as .wav file
As a counterpoint to our discussion above, when ICA does not succeed in unmixing the signals it is because the signals are statistically dependdent. In the next cell, we consider an example from the signal unmixing where ICA does not work to illustrate how we can visually spot relative approximate dependence.
##TODO: Generate two signal waveforms (they cannot be the same waveform :-) -- mix them and show that ICA fails to recover them.
## Use the same visualization as before to illustrate visually the dependence structure
In many real-world applications signals are not completely independent or dependent. In the next cell, we consider an example where ICA does not seem to work.
image1 = "images/face1.jpg"; image2 = "images/face2.jpg"; ## what are the original faces?
# Load images
I1 = Float64.(Gray.(load(image1)))
I2 = Float64.(Gray.(load(image2)))
sizeI1 = size(I1)
S = [vec(I1) vec(I2)]';
I1, I2 = 0, 0
#gc()
# Mix images
A = [0.5 0.5; 0.5 -0.5];
Y = A*S
#gc()
mixed1 = reshape(Y[1, :]/maximum(Y[1, :]), sizeI1)
mixed2 = reshape(Y[2, :]/maximum(Y[2, :]), sizeI1)
## TODO: attempt to unmix
W,Sica = ica_factorization(Y)
unmixed1 = reshape(Sica[1,:],sizeI1)
unmixed2 = reshape(Sica[2,:],sizeI1)
Sica , Y = 0, 0
#gc()
p1 = imshow(mixed1,color=:grays, aspect_ratio=:equal, title = "Mixed Image 1 ")
p2 = imshow(mixed2,color=:grays, aspect_ratio=:equal, title = "Mixed Image 2")
p3 = imshow(unmixed1,color=:grays, aspect_ratio=:equal, title = "Unmixed Image 1")
p4 = imshow(unmixed2,color=:grays, aspect_ratio=:equal, title = "Unmixed Image 2")
plot(p1,p2,p3,p4,layout=4)
In the next plot we show a scatter plot of the pixel values the images revealing the strong correlation structure.
Exercise: What is the origin of the strong correlation? Hint: What does the alignment of the faces have to do with it?
subset = randperm(length(S[1,:]))[1:1000]
scatter(
S[1,subset], S[2,subset],
alpha = 0.15,
xlabel="Image 1 pixel values",
ylabel="Image 2 pixel values"
)
As a whole the pixel values in the images are dependent because of the strong alignment of the faces. However, buried in this scatter plot is valuable information that there are ranges of pixel values in each each -- this corresponds to taking a square (or rectangular) sub-region of the scatter plot; within this range, the pixel values are almost independent.
Exercise:
Can you utilize this insights on the existence of an independent sub-region within the images to succesfully unmix them?
Hint: You will determine the unmixing matrix on just this (nearly) independent sub-region and then apply it to the whole image (which consists of mostly dependent parts).
## Hard Problem: Code to unmix images above --
## Hint: can we relax the notion of independent "everywere" to independent "somewhere" and suceeed?
## Hint: You will have to find the sub-region that is nearly indpendent, determine the unmixing matrix there
## and apply it to the rest of the image.
image1 = "images/face1.jpg"; image2 = "images/face2.jpg"; ## what are the original faces?
# Load images
I1 = Float64.(Gray.(load(image1)))
I2 = Float64.(Gray.(load(image2)))
sizeI1 = size(I1)
S = [vec(I1) vec(I2)]';
I1, I2 = 0, 0
#gc()
# Mix images
A = [0.5 0.5; 0.5 -0.5];
Y = A*S
#gc()
mixed1 = reshape(Y[1, :]/maximum(Y[1, :]), sizeI1)
mixed2 = reshape(Y[2, :]/maximum(Y[2, :]), sizeI1)
## TODO: attempt to unmix
index1 = findall(x->x>=0.2,S[1,:])
index2 = findall(x->x>=0.2,S[2,:])
S1 =vec(I1)
S2 = vec(I2)
W,Sica = ica_factorization(Y[:,700:end])[2]
X = inv(W)*Y
unmixed1 = reshape(X[1,:],sizeI1)
unmixed2 = reshape(X[2,:],sizeI1)
Sica , Y = 0, 0
#gc()
p1 = imshow(mixed1,color=:grays, aspect_ratio=:equal, title = "Mixed Image 1 ")
p2 = imshow(mixed2,color=:grays, aspect_ratio=:equal, title = "Mixed Image 2")
p3 = imshow(unmixed1,color=:grays, aspect_ratio=:equal, title = "Unmixed Image 1")
p4 = imshow(unmixed2,color=:grays, aspect_ratio=:equal, title = "Unmixed Image 2")
plot(p1,p2,p3,p4,layout=4)
image1 = "challenge_mixed1.png"
image2 = "challenge_mixed2.png"
I1 = Float64.(Gray.(load(image1)))
I2 = Float64.(Gray.(load(image2)))
plot(imshow(I1; color=:grays, aspect_ratio=:equal, title="Mixed Image 1 "),
imshow(I2; color=:grays, aspect_ratio=:equal, title="Mixed Image 2"))
##TODO: Unmix theimages
Bonus question:
Who are these individuals?